Multivariate Discretization by Recursive Supervised Bipartition of Graph
نویسندگان
چکیده
In supervised learning, discretization of the continuous explanatory attributes enhances the accuracy of decision tree induction algorithms and naive Bayes classifier. Many discretization methods have been developped, leading to precise and comprehensible evaluations of the amount of information contained in one single attribute with respect to the target one. In this paper, we discuss the multivariate notion of neighborhood, extending the univariate notion of interval. We propose an evaluation criterion of bipartitions, which is based on the Minimum Description Length (MDL) principle [1], and apply it recursively. The resulting discretization method is thus able to exploit correlations between continuous attributes. Its accuracy and robustness are evaluated on real and synthetic data sets. 1 Supervised Partitioning Problems In supervised learning, many inductive algorithms are known to produce better models by discretizing continuous attributes. For example, the naive Bayes classifier requires the estimation of probabilities and the continuous explanatory attributes are not so easy to handle, as they often take too many different values for a direct estimation of frequencies. To circumvent this, a normal distribution of the continuous values can be assumed, but this hypothesis is not always realistic [2]. The same phenomenon leads rules extraction techniques to build poorer sets of rules. Decision tree algorithms carry out a selection process of nominal attributes and cannot handle continuous ones directly. Discretization of a continuous attribute, which consists in building intervals by merging the values of the attribute, appears to be a good solution to these problems. Thus, as the results are easily interpretable and lead to more robust estimations of the class conditional probabilities, supervised discretization is widely use. In [2], a taxonomy of discretization methods is proposed, with three dimensions : supervised vs. unsupervised (considering a class attribute or not), global P. Perner and A. Imiya (Eds.): MLDM 2005, LNAI 3587, pp. 253–264, 2005. c © Springer-Verlag Berlin Heidelberg 2005 254 S. Ferrandiz and M. Boullé vs. local (evaluating the partition as a whole or locally to two adjacent intervals) and static vs. dynamic (performing the discretizations in a preprocessing step or imbedding them in the inductive algorithm). This paper is placed in the supervised context. The aim of the discretization of a single continuous explanatory attribute is to find a partition of its values which best discriminates the class distributions between groups. These groups are intervals and the evaluation of a partition is based on a compromise: fewer intervals and stronger class discrimination are better. Discrimination can be performed in many different ways. For example, – Chimerge [3] applies chi square measure to test the independance of the distributions between groups, – C4.5 [4] uses Shannon’s entropy based information measures to find the most informative partition, – MDLPC [5] defines a description length measure, following the MDL principle, – MODL [6] states a prior probability distribution, leading to a bayesian evaluation of the partitions. The univariate case does not take into account any correlation between the explanatory attributes and fails to discover conjointly defined patterns. This fact is usually illustrated by the XOR problem (cf. Figure 1): the contributions of the axes have to be considered conjointly. Many authors have thus introduced a fourth category in the preceding taxonomy: multivariate vs. univariate (searching for cut points simultaneously or not), and proposed multivariate methods (see for examples [7] and [8]). These aim at improving rules extraction algorithms and build conjonctions of intervals. It means that considered patterns are parallelepipeds. This can be a limiting condition as underlying structures of the data are not necessarily so squared (cf. Figure 2). We then distinguish these strongly biased multivariate techniques from weakly biased multivariate ones, that consider more generic patterns. This opposition is slightly discussed in [2], where the authors talk about feature space and instance space discretizations respectively. Fig. 1. The XOR problem: projection on the axes leads to an information loss Multivariate Discretization by Recursive Supervised Bipartition of Graph 255 Fig. 2. A challenging synthetic dataset for strongly biased multivariate discretization methods We present in this paper a new discretization method, which is supervised, local, static, multivariate and weakly biased. As for the MDLPC method, an evaluation criterion of a bipartition is settled following the MDL principle and applied recursively. The remainder of the paper is organized as follow. We first set the notations (section 2). Then, we describe the MDLPC technique (section 3) and our framework (section 4). We propose a new evaluation criterion for bipartitions (section 5) and test its validity on real and synthetic datasets (section 6). Finally, we conclude and point out future works (section 7).
منابع مشابه
Supervised Discretization for Rough Sets – a Neighborhood Graph Approach
Rough set theory has become an important mathematical tool for dealing with uncertainty in data. The data discretization is one of the main problems to be solved in the process of synthesis of decision rules from table-organized data. In this paper, we present a new discretization method in the context of supervised training. This method is based on the neighborhood graph. To evaluate supervise...
متن کاملSupervised Dynamic and Adaptive Discretization for Rule Mining
Association rule mining is a well-researched topic in data mining. However, a common limitation with existing algorithms is that they mainly deal with categorical data. In this work we propose a methodology that allows adaptive discretization and quantitative rule discovery in large mixed databases. More specifically, we propose a top-down, recursive approach to find ranges of values for contin...
متن کاملReal Time Stereo Based Obstacle Detection for UAV Threat Avoidance
We present a system for UAV obstacle detection on embedded hardware based on Sarnoff Corp's Acadia I vision processor for 23Hz 640x480 binocular stereo and 10Hz mincut based recursive bipartition of an affinity graph. We briefly describe the system architecture, followed by performance results on simulated imagery, indoor and outdoor imagery, and flight experiments.
متن کاملThe eccentric connectivity index of bucket recursive trees
If $G$ is a connected graph with vertex set $V$, then the eccentric connectivity index of $G$, $xi^c(G)$, is defined as $sum_{vin V(G)}deg(v)ecc(v)$ where $deg(v)$ is the degree of a vertex $v$ and $ecc(v)$ is its eccentricity. In this paper we show some convergence in probability and an asymptotic normality based on this index in random bucket recursive trees.
متن کاملPerfect Matchings in Edge-Transitive Graphs
We find recursive formulae for the number of perfect matchings in a graph G by splitting G into subgraphs H and Q. We use these formulas to count perfect matching of P hypercube Qn. We also apply our formulas to prove that the number of perfect matching in an edge-transitive graph is , where denotes the number of perfect matchings in G, is the graph constructed from by deleting edges with an en...
متن کامل